Colombia COVID-19

LINK: https://www.kaggle.com/camesruiz/colombia-covid19-complete-dataset

DESCRIPTION: Coronavirus (COVID-19) made its outbreak in Colombia with the first confirmed in the country on march 6th, since then, number of confirmed cases has been increasing and deaths related to the virus are starting to have the first confirmed cases. This data set contains complete information about confirmed cases, deaths and number of recovered patients according to the daily reports by the colombian health department (Ministerio de Salud)

GOAL: Build a model for the number of confirmed cases in the different Colombia regions. You have the access to some covariates, such as: Edad (age), Sexo (Sex), Pais de procedencia (origin country) of the individual cases. Try to test the predictive accuracy of your selected model.

ATTENTION: Three countries are here considered: Colombia, Mexico and India. Each different group of students should focus on a geographical sub-area of one of the three countries, say the northern, the central or the southern part of the countries, by pooling all the regions/states/departments belonging to the considered area. Say: group A focuses on Northern Mexico, group B on Central Mexico, and so on. The distinction in northern, central and southern is not strict, you have some flexibility.


Our Project

We decided to do central Colombia because it is where the capital is.

The largest cities in the country are Bogotá (in the center), Medellín (in the north, close to central), Cali (in the center) and Barranquilla (extreme north).

Dataset - First Exploration

##  [1] "Bogotá D.C."           "Valle del Cauca"      
##  [3] "Antioquia"             "Cartagena D.T. y C"   
##  [5] "Huila"                 "Meta"                 
##  [7] "Risaralda"             "Norte de Santander"   
##  [9] "Caldas"                "Cundinamarca"         
## [11] "Barranquilla D.E."     "Santander"            
## [13] "Quindío"               "Tolima"               
## [15] "Cauca"                 "Santa Marta D.T. y C."
## [17] "Cesar"                 "San Andrés"           
## [19] "Casanare"              "Nariño"               
## [21] "Atlántico"             "Boyacá"               
## [23] "Córdoba"               "Bolívar"              
## [25] "Sucre"                 "La Guajira"
## # A tibble: 26 x 2
## # Groups:   Departamento o Distrito [26]
##    `Departamento o Distrito`     n
##    <chr>                     <int>
##  1 Antioquia                   127
##  2 Atlántico                     4
##  3 Barranquilla D.E.            31
##  4 Bogotá D.C.                 542
##  5 Bolívar                       3
##  6 Boyacá                        6
##  7 Caldas                       16
##  8 Cartagena D.T. y C           39
##  9 Casanare                      2
## 10 Cauca                        12
## # ... with 16 more rows

Colombia is divided into 32 departments. According to Wikipedia we miss the Departments of Amazonas, Arauca, Caquetá, Chocó, Guainía, Guaviare, Magdalena, Putumayo, Vaupés, Vichada.

Bogotá, Distrito Capital in in the Cundinamarca Department. Barranquilla D.E. is a “Distrito Especial” but should be in the Atlántico Department.

The Districts (Spanish: Distrito) in Colombia are cities that have a feature that highlights them, such as its location and trade, history or tourism. Arguably, the districts are special municipalities. The districts are Bogotá, Barranquilla, Cartagena, Santa Marta, Cúcuta, Popayán, Tunja, Buenaventura, Turbo and Tumaco.

We miss Cúcuta, Popayán, Tunjaa, Buenaventura, Turbo and Tumaco.

# lat-long bogota<c(4.592164298, -74.072166378, 542)
valle_de_cauca <- c("Valle del Cauca", 3.43722, -76.5225, 150)
cauca <- c("Cauca", 2.43823, -76.6131592, 12)
antioquia <- c("Antioquia", 6.25184, -75.56359, 127)
cartagena <- c("Cartagena D.T. y C", 10.39972, -75.51444, 39)
huila <- c("Huila", 2.9273, -75.2818909, 30)
meta <- c("Meta", 4.142, -73.62664, 12)
risaralda <- c("Risaralda", 4.8133302, -75.6961136, 35)
norte_santander <- c("Norte de Santander", 7.89391, -72.50782, 21)
caldas <- c("Caldas", 5.06889, -75.51738, 16)
cudinamarca <- c("Cundinamarca", 4.862437, -74.058655, 42)
barraquilla <- c("Barranquilla D.E.", 10.96854, -74.78132, 35)  #atlantico
santader <- c("Santander", 7.12539, -73.1198, 12)
quindio <- c("Quindío", 4.535, -75.67569, 23)
tolima <- c("Tolima", 4.43889, -75.2322235, 14)
santa_marta <- c("Santa Marta D.T. y C.", 11.24079, -74.19904, 12)
cesar <- c("Cesar", 10.46314, -73.25322, 16)
san_andres <- c("San Andrés", 12.5847197, -81.7005615, 2)
casanare <- c("Casanare", 5.33775, -72.39586, 2)
narino <- c("Nariño", 1.21361, -77.28111, 6)
boyaca <- c("Boyacá", 5.767222, -72.940651, 6)
cordoba <- c("Córdoba", 8.74798, -75.88143, 2)
bolivar <- c("Bolívar", 10.39972, -75.51444, 3)
sucre <- c("Sucre", 9.3047199, -75.3977814, 1)
guajira <- c("La Guajira", 11.54444, -72.90722, 1)

gis_data <- data.frame(name = "Bogotá D.C.", latitude = 4.624335, longitude = -74.063644, 
    cases = 542)  #bogotà
gis_data$name <- as.character(gis_data$name)
gis_data <- rbind(gis_data, cauca)
gis_data <- rbind(gis_data, valle_de_cauca)
gis_data <- rbind(gis_data, antioquia)
gis_data <- rbind(gis_data, cartagena)
gis_data <- rbind(gis_data, huila)
gis_data <- rbind(gis_data, meta)
gis_data <- rbind(gis_data, risaralda)
gis_data <- rbind(gis_data, norte_santander)
gis_data <- rbind(gis_data, caldas)
gis_data <- rbind(gis_data, cudinamarca)
gis_data <- rbind(gis_data, barraquilla)
gis_data <- rbind(gis_data, santader)
gis_data <- rbind(gis_data, quindio)
gis_data <- rbind(gis_data, tolima)
gis_data <- rbind(gis_data, santa_marta)
gis_data <- rbind(gis_data, cesar)
gis_data <- rbind(gis_data, san_andres)
gis_data <- rbind(gis_data, casanare)
gis_data <- rbind(gis_data, narino)
gis_data <- rbind(gis_data, boyaca)
gis_data <- rbind(gis_data, cordoba)
gis_data <- rbind(gis_data, bolivar)
gis_data <- rbind(gis_data, sucre)
gis_data <- rbind(gis_data, guajira)

gis_data$latitude <- as.numeric(gis_data$latitude)
gis_data$longitude <- as.numeric(gis_data$longitude)
gis_data$cases <- as.numeric(gis_data$cases)

The color of the pins is related with the number of cases: if they are less than 10 the color is “green”, if they are less than 100 the color is “orange”, otherwise it is “red”.
On the map there are all the cities/departments for which we have data. We can notice that we don’t have any data in the south of the country.

Reading here and there I found that Colombia in divided in 5 regions, the central one comprises: Boyacá, Tolima, Cundinamarca, Meta, Bogotà DC.

ANGELA: Seeing Wikipedia I think that the Orinoquía Region (Meta, Arauca, Casanare and Vichada Departments) is in the center, so I would add also Arauca, Casanare and Vichada. I noticed that we only have Casanare, the other two doesn’t have data.

GAIA: added Casanare, for the others we have no data!

However, since in our assignment Colombia is divided in 3 parts, I think that we should add some more regions (e.g. Quindío, Valle del Cauca, Risaralda, Celdas, Boyacá and possibly Antioquia and Santander)

Some very basics plots

Let’s check the situation (and also the power of ggplot)!

Scattered infos about pandemic in Colombia (https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Colombia):

  • the quarantine started on the 20th of March, since our data are from 6th of March to 2nd of April, it is very likeliy that quarantine effects are not witnessed in our data.

  • on March the 26th there was a damage in the machine that prepared the samples for processing and subsequent diagnosis of COVID-19, which affected the speed at which results were being produced. This could explain the very low number of confirmed cases.


The major number of cases are in the capital Bogotà.


The previous plot represents the daily incidence of the desease across all the departments we are taking into account.

Let’s check the general trend by looking at the cumulative number of confirmed cases (again, all “our” departments are taken into account):

##    Fecha de diagnóstico Cumulative confirmed
## 1            2020-03-06                    1
## 2            2020-03-09                    3
## 3            2020-03-11                    8
## 4            2020-03-12                   10
## 5            2020-03-13                   13
## 6            2020-03-14                   21
## 7            2020-03-15                   34
## 8            2020-03-16                   46
## 9            2020-03-17                   58
## 10           2020-03-18                   82
## 11           2020-03-19                   99
## 12           2020-03-20                  144
## 13           2020-03-21                  169
## 14           2020-03-22                  197
## 15           2020-03-23                  259
## 16           2020-03-24                  356
## 17           2020-03-25                  404
## 18           2020-03-26                  414
## 19           2020-03-27                  457
## 20           2020-03-28                  517
## 21           2020-03-29                  593
## 22           2020-03-30                  678
## 23           2020-03-31                  758
## 24           2020-04-01                  899
## 25           2020-04-02                  993

Here the growth seems exponential (and this is consistent with the fact that we are studying the early stages of the outbreak).

In order to confirm it we should fit a log-linear model, and check that it produces a constant growth rate (straight line).

Now let’s explore the distribution of cases across genders and age:


Maybe in order to study the distribution of ages we should divide the ages in groups, for example 0-18, 18-30, 30-45, 45-60, 60-75, 75+.


This is quite surprising.. I expected elder people to be more affected by Covid-19!

The overall life expectancy in Colombia at birth is 74.8 years (71.2 years for males and 78.4 years for females). Wikipedia

Instead, the median age of the population in 2015 was 29.5 years (30.4 in 2018, 31.3 in 2020), so it is a Country full of young people! link or link or link

Now we can analyze jointly the distribution of age and sex (sex distribution across group of age):


There isn’t much difference between the sexes among the different group of ages, I have the impression that the covariates present in the dataset won’t help us!! :(

We are now left to explore the Tipo variable:


I think that en estudio means that it is not clear while the case is imported or not, however it seems like there are more imported cases, we can count them:

## # A tibble: 3 x 3
##   tipo        total_number percentage
##   <chr>              <int> <chr>     
## 1 Relacionado          291 29.3%     
## 2 Importado            467 47.0%     
## 3 En estudio           235 23.7%

Almost half of the total confirmed cases in our region of interest are imported, and a significant percentage is anknown wheter it is imported or not. Again this is in some sense interesting, but I don’t see clearly why this should be helpful in our model!

## # A tibble: 55 x 2
## # Groups:   País de procedencia [55]
##    `País de procedencia`     n
##    <chr>                 <int>
##  1 0                         1
##  2 Alemania                  4
##  3 Alemania - Estambul       1
##  4 Arabia                    1
##  5 Argentina                 2
##  6 Aruba                     2
##  7 Bélgica                   1
##  8 Brasil                   10
##  9 Canadá                    1
## 10 Chile                     2
## # ... with 45 more rows

here data are a bit dirty, however I don’t know if the effort of cleansing them will worth the result.. it depends wheter we decide to use this info in our analysis.

Analyze the epidemic curve separately for each department, considering only those that have more than 30 cases:

## # A tibble: 83 x 3
##    date       dep                 n
##    <date>     <chr>           <int>
##  1 2020-03-06 Bogotá D.C.         1
##  2 2020-03-09 Antioquia           1
##  3 2020-03-09 Valle del Cauca     1
##  4 2020-03-11 Antioquia           3
##  5 2020-03-11 Bogotá D.C.         2
##  6 2020-03-12 Bogotá D.C.         2
##  7 2020-03-13 Bogotá D.C.         1
##  8 2020-03-13 Valle del Cauca     1
##  9 2020-03-14 Antioquia           3
## 10 2020-03-14 Bogotá D.C.         5
## # ... with 73 more rows

Analyze the curve of cumulative confirmed cases on those “relevant” department:


We can see that (except for the Bogotà department) we have a lot of “missing” columns, these are the days in which no data was recorded for the corresponding department, the cumulative number of cases is the same as the previous day reported in the dataset. Maybe there is a way to fix this!

##          date             dep  n cumulative_dep elapsed_time
## 1  2020-03-06     Bogotá D.C.  1              1            0
## 2  2020-03-09       Antioquia  1              1            3
## 3  2020-03-09 Valle del Cauca  1              1            3
## 4  2020-03-11       Antioquia  3              4            5
## 5  2020-03-11     Bogotá D.C.  2              3            5
## 6  2020-03-12     Bogotá D.C.  2              5            6
## 7  2020-03-13     Bogotá D.C.  1              6            7
## 8  2020-03-13 Valle del Cauca  1              2            7
## 9  2020-03-14       Antioquia  3              7            8
## 10 2020-03-14     Bogotá D.C.  5             11            8
## 11 2020-03-15       Antioquia  1              8            9
## 12 2020-03-15     Bogotá D.C.  8             19            9
## 13 2020-03-15    Cundinamarca  1              1            9
## 14 2020-03-15       Risaralda  1              1            9
## 15 2020-03-15 Valle del Cauca  1              3            9
## 16 2020-03-16     Bogotá D.C. 11             30           10
## 17 2020-03-16    Cundinamarca  1              2           10
## 18 2020-03-17     Bogotá D.C.  9             39           11
## 19 2020-03-17 Valle del Cauca  2              5           11
## 20 2020-03-18     Bogotá D.C.  5             44           12
## 21 2020-03-18    Cundinamarca  2              4           12
## 22 2020-03-18       Risaralda  4              5           12
## 23 2020-03-18 Valle del Cauca  8             13           12
## 24 2020-03-19       Antioquia  3             11           13
## 25 2020-03-19     Bogotá D.C.  8             52           13
## 26 2020-03-19    Cundinamarca  1              5           13
## 27 2020-03-19 Valle del Cauca  1             14           13
## 28 2020-03-20       Antioquia 11             22           14
## 29 2020-03-20     Bogotá D.C. 29             81           14
## 30 2020-03-20    Cundinamarca  3              8           14
## 31 2020-03-20       Risaralda  1              6           14
## 32 2020-03-20 Valle del Cauca  1             15           14
## 33 2020-03-21       Antioquia  3             25           15
## 34 2020-03-21     Bogotá D.C.  6             87           15
## 35 2020-03-21    Cundinamarca  1              9           15
## 36 2020-03-21       Risaralda  2              8           15
## 37 2020-03-21 Valle del Cauca 11             26           15
## 38 2020-03-22       Antioquia  5             30           16
## 39 2020-03-22     Bogotá D.C.  4             91           16
## 40 2020-03-22       Risaralda  5             13           16
## 41 2020-03-22 Valle del Cauca  5             31           16
## 42 2020-03-23       Antioquia 22             52           17
## 43 2020-03-23     Bogotá D.C. 22            113           17
## 44 2020-03-23    Cundinamarca  5             14           17
## 45 2020-03-23       Risaralda  1             14           17
## 46 2020-03-23 Valle del Cauca  6             37           17
## 47 2020-03-24     Bogotá D.C. 48            161           18
## 48 2020-03-24    Cundinamarca  7             21           18
## 49 2020-03-24       Risaralda  3             17           18
## 50 2020-03-24 Valle del Cauca 29             66           18
## 51 2020-03-25       Antioquia  8             60           19
## 52 2020-03-25     Bogotá D.C. 16            177           19
## 53 2020-03-25    Cundinamarca  1             22           19
## 54 2020-03-25       Risaralda  2             19           19
## 55 2020-03-25 Valle del Cauca  5             71           19
## 56 2020-03-26     Bogotá D.C.  7            184           20
## 57 2020-03-26 Valle del Cauca  2             73           20
## 58 2020-03-27     Bogotá D.C. 38            222           21
## 59 2020-03-27    Cundinamarca  1             23           21
## 60 2020-03-28       Antioquia  7             67           22
## 61 2020-03-28     Bogotá D.C. 38            260           22
## 62 2020-03-28    Cundinamarca  2             25           22
## 63 2020-03-28 Valle del Cauca 11             84           22
## 64 2020-03-29       Antioquia 19             86           23
## 65 2020-03-29     Bogotá D.C. 33            293           23
## 66 2020-03-29       Risaralda 10             29           23
## 67 2020-03-29 Valle del Cauca  8             92           23
## 68 2020-03-30       Antioquia 10             96           24
## 69 2020-03-30     Bogotá D.C. 56            349           24
## 70 2020-03-30    Cundinamarca  4             29           24
## 71 2020-03-30 Valle del Cauca 13            105           24
## 72 2020-03-31       Antioquia  5            101           25
## 73 2020-03-31     Bogotá D.C. 41            390           25
## 74 2020-03-31    Cundinamarca  9             38           25
## 75 2020-03-31       Risaralda  6             35           25
## 76 2020-03-31 Valle del Cauca 11            116           25
## 77 2020-04-01       Antioquia  6            107           26
## 78 2020-04-01     Bogotá D.C. 82            472           26
## 79 2020-04-01    Cundinamarca  4             42           26
## 80 2020-04-01 Valle del Cauca 31            147           26
## 81 2020-04-02       Antioquia 20            127           27
## 82 2020-04-02     Bogotá D.C. 70            542           27
## 83 2020-04-02 Valle del Cauca  3            150           27

Missing

I still didn’t integrate the “other part” of the dataset, the one concerning deaths!

Ideas

For what concerns the predictive model we want to build, I think that we should start by something very simple (e.g. a (log)linear model) and take it as a baseline.

Then we build something more complex (such as a hierarchical model) and see the improvements with respect to the baseline.

If possible I would put inside something bayesian, since I understood that they really like this kind of stuff!

Gaia I start here with my stream of consciousness about the analysis!

Let’s fix some terminology (references for these are the first two links of the section “Interesting Links”):

  • a case is a person who tested positive for Covid-19;
  • the epidemic curve represents the daily incremental incidence;

Our data are tabulated by date of confirmation by lab test.

The variable we want to predict, say y, (dependent variable) is the (cumulative) number of confirmed cases, namely we want to simulate a (hopefully plausible) epidemic trajectory. Eventually we will project future incidence.

Since our y is a discrete count variable, a linear regression is not appropriate.

The most common choice with count data is to use Poisson regression, which belongs to the GLM class of models.

\[ ln\lambda_i = \beta_0 + \beta_1x_i \\ y_i \sim \mathcal{Poisson}(\lambda_i) \]

I think that, as first step, we should consider the most parsimonious model, taking only the time as independent variable. Here time can be intended both as the Fecha de diagnostico attribute of our dataset, or as the number of days elapsed from the earlier day in the dataset (in this case we should derive this attribute).

Further steps will include more covariates, and a model selection phase should follow. At the moment we are basically working with time series data.

Important: recall that modelling count data with a Poisson regression we are assuming equidispersion (the mean coincides with the variance), however we have no guarantees that this is true for our data, we need to take this into account!

The days 7/03/20, 8/03/20, 10/03/20 are missing, probably because no case was detected in those days (consistent with the fact that it was the very beginning of the outbreak).

##          date   y elapsed_time
## 1  2020-03-06   1            0
## 2  2020-03-09   3            3
## 3  2020-03-11   8            5
## 4  2020-03-12  10            6
## 5  2020-03-13  13            7
## 6  2020-03-14  21            8
## 7  2020-03-15  34            9
## 8  2020-03-16  46           10
## 9  2020-03-17  58           11
## 10 2020-03-18  82           12
## 11 2020-03-19  99           13
## 12 2020-03-20 144           14
## 13 2020-03-21 169           15
## 14 2020-03-22 197           16
## 15 2020-03-23 259           17
## 16 2020-03-24 356           18
## 17 2020-03-25 404           19
## 18 2020-03-26 414           20
## 19 2020-03-27 457           21
## 20 2020-03-28 517           22
## 21 2020-03-29 593           23
## 22 2020-03-30 678           24
## 23 2020-03-31 758           25
## 24 2020-04-01 899           26
## 25 2020-04-02 993           27

Description of variables

ID de caso. ID of the confirmed case.

Fecha de diagnóstico. Date that the disease was diagnosed.

Ciudad de ubicación. City where the case was diagnosed.

Departamento o Distrito. Department or distric that the city belongs to.

Atención. Situation of the pacient: recovered, at home, at the hospital, at the ICU or deceased.

Edad. Age of the confirmed case.

Sexo. Sex of the confirmed cade.

Tipo. How the person got infected: in Colombia, abroad or unknown.

País de procedencia. Country of origin in case the person got infected abroad.

Country of origin

Cleaning the “País de procedencia” variable and creating another variable related to the continents. As País de procedencia is too fragmented is good to group them into continents and see if there is a difference in the model.

We clean the column País de procedencia by making sure that there are no cities instead of countries and that the countries are separated with a dash (we’ll use it with dummy variables)

##  [1] "0"                                              
##  [2] "Alemania"                                       
##  [3] "Alemania - Estambul"                            
##  [4] "Arabia"                                         
##  [5] "Argentina"                                      
##  [6] "Aruba"                                          
##  [7] "Bélgica"                                        
##  [8] "Brasil"                                         
##  [9] "Canadá"                                         
## [10] "Chile"                                          
## [11] "Colombia"                                       
## [12] "Costa Rica"                                     
## [13] "Croacia"                                        
## [14] "Cuba"                                           
## [15] "Ecuador"                                        
## [16] "Egipto"                                         
## [17] "Emiratos Árabes"                                
## [18] "España"                                         
## [19] "España - Croacia"                               
## [20] "España - Croacia - Bosnia"                      
## [21] "España - Egipto"                                
## [22] "España - Francia"                               
## [23] "España - India"                                 
## [24] "España - Italia"                                
## [25] "España - Turquia"                               
## [26] "Estados Unidos"                                 
## [27] "Europa"                                         
## [28] "Francia"                                        
## [29] "Francia - Holanda"                              
## [30] "Grecia"                                         
## [31] "Guatemala"                                      
## [32] "Inglaterra"                                     
## [33] "Isla Martin - Caribe"                           
## [34] "Islas San Martin"                               
## [35] "Israel Egipto"                                  
## [36] "Italia"                                         
## [37] "Italia - España - Turquía"                      
## [38] "Italia - Jamaica - Panamá"                      
## [39] "Italia - Ucrania - España"                      
## [40] "Jamaica"                                        
## [41] "Jamaica - Isla Caimán - Panamá"                 
## [42] "Jamaica - Panamá - Isla Caimán"                 
## [43] "Jamaica - Panamá - Islas del caribe - Cartagena"
## [44] "Londres"                                        
## [45] "Madrid"                                         
## [46] "Marruecos"                                      
## [47] "México"                                         
## [48] "Panamá"                                         
## [49] "Panamá - Jamaica"                               
## [50] "Perú"                                           
## [51] "Puerto Rico"                                    
## [52] "República Dominicana"                           
## [53] "Suiza"                                          
## [54] "Turquía"                                        
## [55] "Turquía - Grecia"                               
## [56] "Venezuela"
## [1] 911
## # A tibble: 1 x 9
##   `ID de caso` `Fecha de diagn~ `Ciudad de ubic~ `Departamento o~
##          <int> <date>           <chr>            <chr>           
## 1          911 2020-04-02       Medellín         Antioquia       
## # ... with 5 more variables: `Atención**` <chr>, Edad <dbl>, Sexo <chr>,
## #   tipo <chr>, `País de procedencia` <chr>
# Standardizing the country names
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == 
    "Isla Martin - Caribe"] <- "Islas San Martin"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == 
    "Israel Egipto"] <- "Israel - Egipto"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == 
    "Jamaica - Isla Caimán - Panamá"] <- "Jamaica - Panamá - Isla Caimán"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == 
    "Madrid"] <- "España"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == 
    "Londres"] <- "Inglaterra"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == 
    "Alemania - Estambul"] <- "Alemania - Turquía"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == 
    "Jamaica - Panamá - Islas del caribe - Cartagena"] <- "Jamaica - Panamá - Colombia"

# Creating continents
Europa <- c("Alemania", "Bélgica", "Europa", "Croacia", "España", "España - Croacia", 
    "España - Croacia - Bosnia", "España - Francia", "España - Italia", "Francia", 
    "Francia - Holanda", "Grecia", "Inglaterra", "Italia", "Italia - Ucrania - España", 
    "Suiza")
Asia <- c("Arabia", "Emiratos Árabes", "Turquía")
África <- c("Egipto", "Marruecos")
Norteamérica <- c("Canadá", "Estados Unidos", "México")
Centroamérica <- c("Aruba", "Costa Rica", "Cuba", "Guatemala", "Islas San Martin", 
    "Jamaica", "Jamaica - Panamá - Isla Caimán", "Jamaica - Panamá - Islas del caribe - Cartagena", 
    "Panamá", "Panamá - Jamaica", "Puerto Rico", "República Dominicana")
Sudamerica <- c("Argentina", "Brasil", "Chile", "Ecuador", "Perú", "Venezuela")
# Alemania - Turquía', 'España - India', 'España - Turquia', 'Italia -
# España - Turquía', 'Turquía - Grecia' -> 'Europa - Asia' 'España - Egipto'
# -> 'Europa - África' 'Israel - Egipto' -> 'Asia - África' 'Italia -
# Jamaica - Panamá' -> 'Europa - Centroamérica' 'Colombia' -> 'Colombia'

for (i in 1:nrow(colombia_covid)) {
    if (colombia_covid$`País de procedencia`[i] %in% Europa) {
        colombia_covid$`Continente de procedencia`[i] <- "Europa"
    } else if (colombia_covid$`País de procedencia`[i] %in% Asia) {
        colombia_covid$`Continente de procedencia`[i] <- "Asia"
    } else if (colombia_covid$`País de procedencia`[i] %in% África) {
        colombia_covid$`Continente de procedencia`[i] <- "África"
    } else if (colombia_covid$`País de procedencia`[i] %in% Norteamérica) {
        colombia_covid$`Continente de procedencia`[i] <- "Norteamérica"
    } else if (colombia_covid$`País de procedencia`[i] %in% Centroamérica) {
        colombia_covid$`Continente de procedencia`[i] <- "Centroamérica"
    } else if (colombia_covid$`País de procedencia`[i] %in% Sudamerica) {
        colombia_covid$`Continente de procedencia`[i] <- "Sudamerica"
    } else if (colombia_covid$`País de procedencia`[i] == "Colombia") {
        colombia_covid$`Continente de procedencia`[i] <- "Colombia"
    } else if (colombia_covid$`País de procedencia`[i] == "Alemania - Turquía") {
        colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia"
    } else if (colombia_covid$`País de procedencia`[i] == "España - India") 
        colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia" else if (colombia_covid$`País de procedencia`[i] == "España - Turquia") 
        colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia" else if (colombia_covid$`País de procedencia`[i] == "Italia - España - Turquía") 
        colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia" else if (colombia_covid$`País de procedencia`[i] == "Turquía - Grecia") 
        colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia" else if (colombia_covid$`País de procedencia`[i] == "España - Egipto") 
        colombia_covid$`Continente de procedencia`[i] <- "Europa - África" else if (colombia_covid$`País de procedencia`[i] == "Israel - Egipto") 
        colombia_covid$`Continente de procedencia`[i] <- "Asia - África" else if (colombia_covid$`País de procedencia`[i] == "Italia - Jamaica - Panamá") 
        colombia_covid$`Continente de procedencia`[i] <- "Europa - Centroamérica"
}
# Transforming the 0 value into a null value
library(naniar)
colombia_covid <- colombia_covid %>% replace_with_na(replace = list(`País de procedencia` = 0))
# colombia_covid[911,]$`País de procedencia` <- NA

Preparation of the dataset to run the models

Giullia. So, as we have to group the data by date to be able to run the models, we have to count the occurance of the categorical variables in each day. So that’s what I did below and described each step in detail.

Dropping the variables Ciudad de ubicación, Atención** and Tipo*.

Considering the variable “Ciudad de ubicación” I decided to keep just the departments variable to focus the analysis in dividing the region into departments and not in cities.

Considering the variable "Atención**" at the moment I think is not relevant to know how many confirmed cases there will be, but we can add later and check.

Considering the variable "Tipo*" many observations are “in study”, so it is better to use just the variable country of origin (País de procedencia), because they consider the ones “in study” as Colombia.

Transforming the age values into age ranges to be able to transform it in a dummy variable

Transform the categorical variables into dummy variables

Counting the occurance of the dummy variables by date and deleting the Fecha de diagnóstico and ID de caso because I will bind with the cumulative2 table made by Gaia.

Running Poisson

Running poisson with just the variable representing the time as predictor

## 
## Call:
## glm(formula = y ~ elapsed_time, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.7629  -4.0644  -0.8324   1.5642   6.4039  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.461377   0.052885   46.54   <2e-16 ***
## elapsed_time 0.169656   0.002346   72.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7859.81  on 24  degrees of freedom
## Residual deviance:  280.62  on 23  degrees of freedom
## AIC: 446.83
## 
## Number of Fisher Scoring iterations: 4

Results show the variable is very significant.



According to the plot there is presence of overdispersion because the standardized residuals are outside the -1 to 1 range assumed by a Poisson distribution. In addition, the residuals are also distributed according to a Poisson. This is surprising and I don’t know what does it mean. I have to research.

## [1] 11.05917

Extremely high!!!

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.7629 -4.0644 -0.8324 -0.6803  1.5642  6.4039
## [1] "Null deviance: 7859.81"    "Residual deviance: 280.62"

Stan model just to compare

## 
## SAMPLING FOR MODEL 'poisson_regression' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.152 seconds (Warm-up)
## Chain 1:                0.149 seconds (Sampling)
## Chain 1:                0.301 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'poisson_regression' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.216 seconds (Warm-up)
## Chain 2:                0.138 seconds (Sampling)
## Chain 2:                0.354 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'poisson_regression' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.18 seconds (Warm-up)
## Chain 3:                0.103 seconds (Sampling)
## Chain 3:                0.283 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'poisson_regression' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 0 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.167 seconds (Warm-up)
## Chain 4:                0.092 seconds (Sampling)
## Chain 4:                0.259 seconds (Total)
## Chain 4:
## Inference for Stan model: poisson_regression.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##       mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## alpha 2.46       0 0.05 2.35 2.42 2.46 2.49  2.56   476 1.01
## beta  0.17       0 0.00 0.17 0.17 0.17 0.17  0.17   507 1.01
## 
## Samples were drawn using NUTS(diag_e) at Thu Jul 02 06:00:20 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Let’s apply a quasi poisson and see what happens

## 
## Call:
## glm(formula = y ~ elapsed_time, family = quasipoisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.7629  -4.0644  -0.8324   1.5642   6.4039  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.461377   0.175873   13.99 9.69e-13 ***
## elapsed_time 0.169656   0.007801   21.75  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 11.0593)
## 
##     Null deviance: 7859.81  on 24  degrees of freedom
## Residual deviance:  280.62  on 23  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

## [1] 0.9999886

So I applied the quasi poisson and the overdispersion reduced to 1. Anyway, the residuals are still behaving like a distribution. Have to understand why is that.

Running poisson with the variable representing time plus gender

## 
## Call:
## glm(formula = y ~ elapsed_time + Sexo_M, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.7833  -4.0135  -0.7867   1.4556   6.8191  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.441994   0.057694  42.327   <2e-16 ***
## elapsed_time  0.172311   0.003911  44.055   <2e-16 ***
## Sexo_M       -0.001086   0.001279  -0.849    0.396    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7859.8  on 24  degrees of freedom
## Residual deviance:  279.9  on 22  degrees of freedom
## AIC: 448.11
## 
## Number of Fisher Scoring iterations: 4

Results show that the variable gender is not signficant and thus doesn’t change the performance of the model. We were already expecting this as in the data visualization step the gender is equaly distributed across the confirmed cases.

Running poisson with the variable representing time plus the age variables

## 
## Call:
## glm(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + 
##     `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+`, family = poisson)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -4.471  -2.342  -1.151   1.225   4.598  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     2.389641   0.061467  38.877  < 2e-16 ***
## elapsed_time    0.168236   0.004012  41.936  < 2e-16 ***
## `Edad_19 a 30` -0.019522   0.005239  -3.726 0.000194 ***
## `Edad_31 a 45`  0.012599   0.002618   4.813 1.49e-06 ***
## `Edad_46 a 60`  0.007874   0.004040   1.949 0.051281 .  
## `Edad_60 a 75` -0.031843   0.005307  -6.000 1.97e-09 ***
## `Edad_76+`      0.154509   0.018747   8.242  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7859.81  on 24  degrees of freedom
## Residual deviance:  200.75  on 18  degrees of freedom
## AIC: 376.95
## 
## Number of Fisher Scoring iterations: 4

Results show that 4 out of the 5 age ranges are significant predictors, showing that age does have an effect in predicting the number of confirmed cases. Furthermore, it can be seen that the AIC reduced in comparison with the model that has just the variable time as predictor (poisson1).

Plotting the residuals versus fit and calculating the overdispersion


## [1] 10.00153

The overdispersion is high and again the residuals continue to behave like a distribution.

##           fit         lwr        upr
## 1    10.69867    9.483496   12.06955
## 2    18.44549   16.642757   20.44350
## 3    23.58488   21.414156   25.97565
## 4    30.69966   28.202107   33.41840
## 5    33.91301   31.241180   36.81334
## 6    39.38810   36.390180   42.63299
## 7    43.25180   39.624844   47.21074
## 8    58.28427   54.096571   62.79615
## 9    63.86224   59.249298   68.83432
## 10  101.87534   94.383113  109.96232
## 11   94.17614   88.682207  100.01043
## 12   98.45408   90.523298  107.07969
## 13  125.57281  117.799392  133.85919
## 14  157.72885  149.090560  166.86765
## 15  224.15214  202.446432  248.18507
## 16  276.14626  255.037191  299.00250
## 17  379.86926  354.915995  406.57693
## 18  354.95697  333.757447  377.50305
## 19  508.91939  482.062562  537.27247
## 20  496.10852  473.114007  520.22063
## 21  568.46889  534.472988  604.62715
## 22  708.40069  670.781153  748.13005
## 23  819.09022  782.791250  857.07242
## 24  917.89085  863.784736  975.38608
## 25 1059.06144 1005.617004 1115.34623


## 
## Call:
## glm(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + 
##     `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+`, family = quasipoisson)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -4.471  -2.342  -1.151   1.225   4.598  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.389641   0.194392  12.293 3.42e-10 ***
## elapsed_time    0.168236   0.012687  13.260 9.95e-11 ***
## `Edad_19 a 30` -0.019522   0.016569  -1.178   0.2540    
## `Edad_31 a 45`  0.012599   0.008278   1.522   0.1454    
## `Edad_46 a 60`  0.007874   0.012776   0.616   0.5454    
## `Edad_60 a 75` -0.031843   0.016783  -1.897   0.0740 .  
## `Edad_76+`      0.154509   0.059290   2.606   0.0179 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 10.00169)
## 
##     Null deviance: 7859.81  on 24  degrees of freedom
## Residual deviance:  200.75  on 18  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

When we apply the quasi poisson model the age variables become not signficant anymore. But we know they are. So, I think is not a problem.


## [1] 0.9999845

By applying the quasi poisson the overdispersion reduced to 1, and again the residuals are still behaving like a distribution.

Running poisson with the variable representing time, age and departments as predictors

## 
## Call:
## glm(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + 
##     `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` + 
##     `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + 
##     `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + 
##     `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + 
##     `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + 
##     `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + 
##     `Departamento o Distrito_Valle del Cauca`, family = poisson)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.89793  -0.11401   0.01296   0.18169   0.83696  
## 
## Coefficients:
##                                           Estimate Std. Error z value
## (Intercept)                                0.73607    0.18896   3.895
## elapsed_time                               0.28751    0.02018  14.245
## `Edad_19 a 30`                             0.05362    0.02475   2.166
## `Edad_31 a 45`                             0.08583    0.02787   3.080
## `Edad_46 a 60`                             0.02671    0.03566   0.749
## `Edad_60 a 75`                             0.19705    0.15274   1.290
## `Edad_76+`                                -0.33800    0.43974  -0.769
## `Departamento o Distrito_Bogotá D.C.`     -0.10930    0.03963  -2.758
## `Departamento o Distrito_Boyacá`           0.25277    0.73715   0.343
## `Departamento o Distrito_Caldas`           0.46662    0.26875   1.736
## `Departamento o Distrito_Casanare`        -2.08700    1.82542  -1.143
## `Departamento o Distrito_Cauca`           -0.41716    0.36529  -1.142
## `Departamento o Distrito_Cundinamarca`     0.16497    0.04484   3.679
## `Departamento o Distrito_Meta`            -0.27179    0.12534  -2.168
## `Departamento o Distrito_Quindío`          0.20614    0.37399   0.551
## `Departamento o Distrito_Risaralda`       -0.38367    0.18819  -2.039
## `Departamento o Distrito_Santander`        0.41024    0.18668   2.198
## `Departamento o Distrito_Tolima`           0.68272    0.26056   2.620
## `Departamento o Distrito_Valle del Cauca` -0.13527    0.05665  -2.388
##                                           Pr(>|z|)    
## (Intercept)                               9.81e-05 ***
## elapsed_time                               < 2e-16 ***
## `Edad_19 a 30`                            0.030315 *  
## `Edad_31 a 45`                            0.002073 ** 
## `Edad_46 a 60`                            0.453829    
## `Edad_60 a 75`                            0.197009    
## `Edad_76+`                                0.442108    
## `Departamento o Distrito_Bogotá D.C.`     0.005820 ** 
## `Departamento o Distrito_Boyacá`          0.731678    
## `Departamento o Distrito_Caldas`          0.082519 .  
## `Departamento o Distrito_Casanare`        0.252915    
## `Departamento o Distrito_Cauca`           0.253451    
## `Departamento o Distrito_Cundinamarca`    0.000234 ***
## `Departamento o Distrito_Meta`            0.030127 *  
## `Departamento o Distrito_Quindío`         0.581506    
## `Departamento o Distrito_Risaralda`       0.041476 *  
## `Departamento o Distrito_Santander`       0.027984 *  
## `Departamento o Distrito_Tolima`          0.008789 ** 
## `Departamento o Distrito_Valle del Cauca` 0.016952 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7859.8137  on 24  degrees of freedom
## Residual deviance:    3.3026  on  6  degrees of freedom
## AIC: 203.51
## 
## Number of Fisher Scoring iterations: 4

The results show that by including the variables representing departments the AIC reduces significantly in comparison with the previous model that included just time and age as predictors. In addition, 8 out of 12 of the dummy variables representing the departments are significant.


## [1] 0.5168401

Now the residuals are behaving differently and the overdispersion reduced significantly and it is inside the range assumed by a Poisson model, that is from -1 to 1. Therefore, there is no need to apply a quasi poisson model because the assumptions of the Poisson distribution are holding.

Running poisson with the variable representing time, age, departments and continent of origin as predictors

## 
## Call:
## glm(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + 
##     `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` + 
##     `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + 
##     `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + 
##     `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + 
##     `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + 
##     `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + 
##     `Departamento o Distrito_Valle del Cauca` + `Continente de procedencia_Asia` + 
##     `Continente de procedencia_Centroamérica` + `Continente de procedencia_Colombia` + 
##     `Continente de procedencia_Europa` + `Continente de procedencia_Norteamérica` + 
##     `Continente de procedencia_Sudamerica`, family = poisson)
## 
## Deviance Residuals: 
##  [1]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## [24]  0  0
## 
## Coefficients:
##                                           Estimate Std. Error z value
## (Intercept)                               -0.28738    1.72656  -0.166
## elapsed_time                               0.35366    0.09012   3.924
## `Edad_19 a 30`                             0.33992    1.29758   0.262
## `Edad_31 a 45`                             0.28653    0.73462   0.390
## `Edad_46 a 60`                             0.27472    1.30079   0.211
## `Edad_60 a 75`                             0.14618    1.61963   0.090
## `Edad_76+`                                 1.20137    4.83430   0.249
## `Departamento o Distrito_Bogotá D.C.`     -0.01024    0.66849  -0.015
## `Departamento o Distrito_Boyacá`          -1.76609    5.65082  -0.313
## `Departamento o Distrito_Caldas`          -1.97829    8.25311  -0.240
## `Departamento o Distrito_Casanare`         5.30073   26.46443   0.200
## `Departamento o Distrito_Cauca`            1.17006    5.91217   0.198
## `Departamento o Distrito_Cundinamarca`    -0.11854    1.10533  -0.107
## `Departamento o Distrito_Meta`            -0.09535    0.59699  -0.160
## `Departamento o Distrito_Quindío`         -1.00160    3.58050  -0.280
## `Departamento o Distrito_Risaralda`        0.17522    2.74291   0.064
## `Departamento o Distrito_Santander`        0.12443    3.55810   0.035
## `Departamento o Distrito_Tolima`           0.57417    3.82009   0.150
## `Departamento o Distrito_Valle del Cauca` -0.15165    0.63455  -0.239
## `Continente de procedencia_Asia`          -0.99501    2.85962  -0.348
## `Continente de procedencia_Centroamérica` -0.05926    0.59201  -0.100
## `Continente de procedencia_Colombia`      -0.31027    1.63761  -0.189
## `Continente de procedencia_Europa`        -0.04230    1.34447  -0.031
## `Continente de procedencia_Norteamérica`  -0.08012    0.54016  -0.148
## `Continente de procedencia_Sudamerica`    -0.63692    1.26602  -0.503
##                                           Pr(>|z|)    
## (Intercept)                                  0.868    
## elapsed_time                               8.7e-05 ***
## `Edad_19 a 30`                               0.793    
## `Edad_31 a 45`                               0.697    
## `Edad_46 a 60`                               0.833    
## `Edad_60 a 75`                               0.928    
## `Edad_76+`                                   0.804    
## `Departamento o Distrito_Bogotá D.C.`        0.988    
## `Departamento o Distrito_Boyacá`             0.755    
## `Departamento o Distrito_Caldas`             0.811    
## `Departamento o Distrito_Casanare`           0.841    
## `Departamento o Distrito_Cauca`              0.843    
## `Departamento o Distrito_Cundinamarca`       0.915    
## `Departamento o Distrito_Meta`               0.873    
## `Departamento o Distrito_Quindío`            0.780    
## `Departamento o Distrito_Risaralda`          0.949    
## `Departamento o Distrito_Santander`          0.972    
## `Departamento o Distrito_Tolima`             0.881    
## `Departamento o Distrito_Valle del Cauca`    0.811    
## `Continente de procedencia_Asia`             0.728    
## `Continente de procedencia_Centroamérica`    0.920    
## `Continente de procedencia_Colombia`         0.850    
## `Continente de procedencia_Europa`           0.975    
## `Continente de procedencia_Norteamérica`     0.882    
## `Continente de procedencia_Sudamerica`       0.615    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7.8598e+03  on 24  degrees of freedom
## Residual deviance: 3.2641e-13  on  0  degrees of freedom
## AIC: 212.2
## 
## Number of Fisher Scoring iterations: 3

The AIC actually increased by adding the continent of origin variables and none of them are significant.

## 
## Call:
## glm(formula = y ~ elapsed_time + `Continente de procedencia_Asia` + 
##     `Continente de procedencia_Centroamérica` + `Continente de procedencia_Colombia` + 
##     `Continente de procedencia_Europa` + `Continente de procedencia_Norteamérica` + 
##     `Continente de procedencia_Sudamerica`, family = poisson)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -4.178  -3.466  -1.204   1.988   6.397  
## 
## Coefficients:
##                                             Estimate Std. Error z value
## (Intercept)                                2.2264438  0.0784497  28.381
## elapsed_time                               0.1774725  0.0040070  44.291
## `Continente de procedencia_Asia`          -0.0372645  0.0099577  -3.742
## `Continente de procedencia_Centroamérica` -0.0040363  0.0065343  -0.618
## `Continente de procedencia_Colombia`       0.0004037  0.0010424   0.387
## `Continente de procedencia_Europa`         0.0118714  0.0054906   2.162
## `Continente de procedencia_Norteamérica`  -0.0073427  0.0052240  -1.406
## `Continente de procedencia_Sudamerica`     0.0264740  0.0114359   2.315
##                                           Pr(>|z|)    
## (Intercept)                                < 2e-16 ***
## elapsed_time                               < 2e-16 ***
## `Continente de procedencia_Asia`          0.000182 ***
## `Continente de procedencia_Centroamérica` 0.536766    
## `Continente de procedencia_Colombia`      0.698530    
## `Continente de procedencia_Europa`        0.030606 *  
## `Continente de procedencia_Norteamérica`  0.159849    
## `Continente de procedencia_Sudamerica`    0.020613 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7859.81  on 24  degrees of freedom
## Residual deviance:  215.47  on 17  degrees of freedom
## AIC: 393.67
## 
## Number of Fisher Scoring iterations: 4

In this case the AIC reduced in comparison with the model with just the time as predictor(poisson1), but just 2 out of 6 variables are significant, so we think is not a very relevant predictor to include in the model with the lowest AIC so far (poisson4).

Applying ANOVA to compare the poisson models

We decided to compare poisson1quasi, poisson3quasi, poisson4, because they are nested and we are more interested in seeing if the model 4 is in fact better than the first model.

## Analysis of Deviance Table
## 
## Model 1: y ~ elapsed_time
## Model 2: y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + `Edad_46 a 60` + 
##     `Edad_60 a 75` + `Edad_76+`
## Model 3: y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + `Edad_46 a 60` + 
##     `Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` + 
##     `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + 
##     `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + 
##     `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + 
##     `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + 
##     `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + 
##     `Departamento o Distrito_Valle del Cauca`
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        23    280.624                          
## 2        18    200.747  5   79.878 8.901e-16 ***
## 3         6      3.303 12  197.444 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The poisson4 model is the best. It has the lower p value.

Two important observations Angela.

  • Is it ok that we compare together a poisson with a quasi poisson?

  • The problem I said I am having is that you can see that the table is giving me just the p value. Is missing other metrics of the test. Go to the comparison of the negative binomial models and you will see that there it is working properly. Is something related to the Poisson model on our data. If I compare just poisson models or just quasi poisson models it continues to give the same problem.

Running Negative Binomial

Running negative binomial with just the variable representing the time as predictor

## 
## Call:
## glm.nb(formula = y ~ elapsed_time, init.theta = 11.25073154, 
##     link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.81454  -1.19152  -0.02362   0.81855   1.41660  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.518348   0.169272    8.97   <2e-16 ***
## elapsed_time 0.219689   0.009522   23.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(11.2507) family taken to be 1)
## 
##     Null deviance: 477.963  on 24  degrees of freedom
## Residual deviance:  27.849  on 23  degrees of freedom
## AIC: 259.91
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  11.25 
##           Std. Err.:  3.87 
## 
##  2 x log-likelihood:  -253.908

The predictor is significant and we get an AIC significantly lower than the poisson model runned with the same predictor.


The residuals are just in the lowerbound exceeding by approximately one unit the -1 to 1 variation range.

## [1] 1.33484

The overdispersion is bit higher than 1, but I think it is still an acceptable number. Gioia said than when it gets closer to 2 that you can accept the overdispersion assumption.

Running negative binomial with the variable representing time plus the age variables as predictors

## 
## Call:
## glm.nb(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + 
##     `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+`, init.theta = 12.57832368, 
##     link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.82305  -1.19078  -0.05259   0.56987   1.86010  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     1.487607   0.194140   7.663 1.82e-14 ***
## elapsed_time    0.225048   0.017409  12.927  < 2e-16 ***
## `Edad_19 a 30`  0.005567   0.023328   0.239    0.811    
## `Edad_31 a 45` -0.002274   0.014144  -0.161    0.872    
## `Edad_46 a 60`  0.002383   0.022039   0.108    0.914    
## `Edad_60 a 75` -0.032190   0.027931  -1.152    0.249    
## `Edad_76+`      0.061832   0.087671   0.705    0.481    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(12.5783) family taken to be 1)
## 
##     Null deviance: 526.498  on 24  degrees of freedom
## Residual deviance:  28.148  on 18  degrees of freedom
## AIC: 267.94
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  12.58 
##           Std. Err.:  4.41 
## 
##  2 x log-likelihood:  -251.944

The AIC increased in comparison with the previous model(nb1) and the age variables are not significant. That’s awkward. In the Poisson model they are.

Running negative binomial with the variable representing time plus the departments variables as predictors

## 
## Call:
## glm.nb(formula = y ~ elapsed_time + `Departamento o Distrito_Bogotá D.C.` + 
##     `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + 
##     `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + 
##     `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + 
##     `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + 
##     `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + 
##     `Departamento o Distrito_Valle del Cauca`, init.theta = 82.27821159, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5375  -0.9716  -0.1456   0.7481   1.9049  
## 
## Coefficients:
##                                            Estimate Std. Error z value
## (Intercept)                                1.038762   0.177075   5.866
## elapsed_time                               0.275143   0.016572  16.603
## `Departamento o Distrito_Bogotá D.C.`     -0.019166   0.004033  -4.752
## `Departamento o Distrito_Boyacá`          -0.447171   0.140811  -3.176
## `Departamento o Distrito_Caldas`          -0.008610   0.086957  -0.099
## `Departamento o Distrito_Casanare`        -0.111260   0.193131  -0.576
## `Departamento o Distrito_Cauca`            0.101655   0.048756   2.085
## `Departamento o Distrito_Cundinamarca`     0.109295   0.034207   3.195
## `Departamento o Distrito_Meta`            -0.088358   0.049785  -1.775
## `Departamento o Distrito_Quindío`         -0.036409   0.048346  -0.753
## `Departamento o Distrito_Risaralda`        0.058404   0.062973   0.927
## `Departamento o Distrito_Santander`       -0.122086   0.153665  -0.794
## `Departamento o Distrito_Tolima`           0.073910   0.093845   0.788
## `Departamento o Distrito_Valle del Cauca` -0.011913   0.012701  -0.938
##                                           Pr(>|z|)    
## (Intercept)                               4.46e-09 ***
## elapsed_time                               < 2e-16 ***
## `Departamento o Distrito_Bogotá D.C.`     2.01e-06 ***
## `Departamento o Distrito_Boyacá`           0.00149 ** 
## `Departamento o Distrito_Caldas`           0.92112    
## `Departamento o Distrito_Casanare`         0.56456    
## `Departamento o Distrito_Cauca`            0.03707 *  
## `Departamento o Distrito_Cundinamarca`     0.00140 ** 
## `Departamento o Distrito_Meta`             0.07594 .  
## `Departamento o Distrito_Quindío`          0.45140    
## `Departamento o Distrito_Risaralda`        0.35369    
## `Departamento o Distrito_Santander`        0.42691    
## `Departamento o Distrito_Tolima`           0.43094    
## `Departamento o Distrito_Valle del Cauca`  0.34825    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(82.2785) family taken to be 1)
## 
##     Null deviance: 2214.864  on 24  degrees of freedom
## Residual deviance:   26.185  on 11  degrees of freedom
## AIC: 247.44
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  82.3 
##           Std. Err.:  36.2 
## Warning while fitting theta: alternation limit reached 
## 
##  2 x log-likelihood:  -217.438

The AIC decreased in comparison with the first model(nb1), but very littlle. In addition, just 5 out of the 12 department dummy variables are significant.

Running negative binomial with the variable representing time plus the continent of origin variables as predictors

## 
## Call:
## glm.nb(formula = y ~ elapsed_time + `Continente de procedencia_Asia` + 
##     `Continente de procedencia_Centroamérica` + `Continente de procedencia_Colombia` + 
##     `Continente de procedencia_Europa` + `Continente de procedencia_Norteamérica` + 
##     `Continente de procedencia_Sudamerica`, init.theta = 22.72556147, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5410  -0.7776   0.1004   0.4496   2.2821  
## 
## Coefficients:
##                                            Estimate Std. Error z value
## (Intercept)                                1.064167   0.206173   5.162
## elapsed_time                               0.239010   0.013619  17.550
## `Continente de procedencia_Asia`          -0.018862   0.044232  -0.426
## `Continente de procedencia_Centroamérica` -0.041791   0.029116  -1.435
## `Continente de procedencia_Colombia`      -0.007457   0.004613  -1.617
## `Continente de procedencia_Europa`         0.047182   0.017591   2.682
## `Continente de procedencia_Norteamérica`  -0.001864   0.023238  -0.080
## `Continente de procedencia_Sudamerica`    -0.018870   0.043695  -0.432
##                                           Pr(>|z|)    
## (Intercept)                               2.45e-07 ***
## elapsed_time                               < 2e-16 ***
## `Continente de procedencia_Asia`           0.66979    
## `Continente de procedencia_Centroamérica`  0.15119    
## `Continente de procedencia_Colombia`       0.10595    
## `Continente de procedencia_Europa`         0.00731 ** 
## `Continente de procedencia_Norteamérica`   0.93606    
## `Continente de procedencia_Sudamerica`     0.66585    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(22.7256) family taken to be 1)
## 
##     Null deviance: 864.028  on 24  degrees of freedom
## Residual deviance:  23.798  on 17  degrees of freedom
## AIC: 254.17
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  22.73 
##           Std. Err.:  7.59 
## Warning while fitting theta: alternation limit reached 
## 
##  2 x log-likelihood:  -236.169

The AIC decreased in comparison with the first model(nb1), but very littlle. In addition, just one out of the 6 department dummy variables is significant.

Running negative binomial with the variable representing time plus the time, age and departments as pedictors

## 
## Call:
## glm.nb(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + 
##     `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` + 
##     `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + 
##     `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + 
##     `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + 
##     `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + 
##     `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + 
##     `Departamento o Distrito_Valle del Cauca`, init.theta = 2453700.017, 
##     link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.89792  -0.11401   0.01296   0.18169   0.83694  
## 
## Coefficients:
##                                           Estimate Std. Error z value
## (Intercept)                                0.73607    0.18896   3.895
## elapsed_time                               0.28751    0.02018  14.245
## `Edad_19 a 30`                             0.05362    0.02475   2.166
## `Edad_31 a 45`                             0.08583    0.02787   3.079
## `Edad_46 a 60`                             0.02671    0.03566   0.749
## `Edad_60 a 75`                             0.19705    0.15274   1.290
## `Edad_76+`                                -0.33801    0.43975  -0.769
## `Departamento o Distrito_Bogotá D.C.`     -0.10930    0.03963  -2.758
## `Departamento o Distrito_Boyacá`           0.25278    0.73716   0.343
## `Departamento o Distrito_Caldas`           0.46662    0.26875   1.736
## `Departamento o Distrito_Casanare`        -2.08704    1.82545  -1.143
## `Departamento o Distrito_Cauca`           -0.41717    0.36530  -1.142
## `Departamento o Distrito_Cundinamarca`     0.16497    0.04484   3.679
## `Departamento o Distrito_Meta`            -0.27178    0.12534  -2.168
## `Departamento o Distrito_Quindío`          0.20615    0.37400   0.551
## `Departamento o Distrito_Risaralda`       -0.38367    0.18819  -2.039
## `Departamento o Distrito_Santander`        0.41024    0.18669   2.197
## `Departamento o Distrito_Tolima`           0.68273    0.26057   2.620
## `Departamento o Distrito_Valle del Cauca` -0.13528    0.05665  -2.388
##                                           Pr(>|z|)    
## (Intercept)                               9.81e-05 ***
## elapsed_time                               < 2e-16 ***
## `Edad_19 a 30`                            0.030317 *  
## `Edad_31 a 45`                            0.002074 ** 
## `Edad_46 a 60`                            0.453827    
## `Edad_60 a 75`                            0.197007    
## `Edad_76+`                                0.442100    
## `Departamento o Distrito_Bogotá D.C.`     0.005820 ** 
## `Departamento o Distrito_Boyacá`          0.731667    
## `Departamento o Distrito_Caldas`          0.082520 .  
## `Departamento o Distrito_Casanare`        0.252912    
## `Departamento o Distrito_Cauca`           0.253449    
## `Departamento o Distrito_Cundinamarca`    0.000234 ***
## `Departamento o Distrito_Meta`            0.030131 *  
## `Departamento o Distrito_Quindío`         0.581496    
## `Departamento o Distrito_Risaralda`       0.041478 *  
## `Departamento o Distrito_Santander`       0.027985 *  
## `Departamento o Distrito_Tolima`          0.008789 ** 
## `Departamento o Distrito_Valle del Cauca` 0.016953 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2453705) family taken to be 1)
## 
##     Null deviance: 7858.9108  on 24  degrees of freedom
## Residual deviance:    3.3026  on  6  degrees of freedom
## AIC: 205.51
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2453700 
##           Std. Err.:  24897176 
## Warning while fitting theta: alternation limit reached 
## 
##  2 x log-likelihood:  -165.509

When the variables age and departments are both included in the model, the AIC reduces in comparison to the model with the lowest AIC so far(nb1).


## [1] 3.853961

Here the overdispersion is a bit high. Angela, do you know if in the negative binomial the assumptions of equidispersion are the same of Poisson? which are:

  • Overdispersion has to be close to one .

  • The range of the standardized residual has to be between -1 and 1.

Applying ANOVA to compare the negative binomial models

We decided to compare nb1, nb2, nb5, because they are nested and we are more interested in seeing if the model 5 is in fact better than the first model.

## Likelihood ratio tests of Negative Binomial Models
## 
## Response: y
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Model
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                elapsed_time
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                               elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+`
## 3 elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + `Departamento o Distrito_Valle del Cauca`
##          theta Resid. df    2 x log-lik.   Test    df  LR stat.
## 1 1.125073e+01        23       -253.9080                       
## 2 1.257832e+01        18       -251.9442 1 vs 2     5  1.963751
## 3 2.453700e+06         6       -165.5091 2 vs 3    12 86.435145
##        Pr(Chi)
## 1             
## 2 8.541378e-01
## 3 2.409184e-13

The fifth model is definetly the best, is the one with the lowest p value.

Preliminary conclusions

In the case of the Poisson, the model with the lower AIC (203.51) includes 3 variables as predictor: time, age and departments

In the case of the Negative binomial, the model with the lower AIC (205.51) includes also the same 3 variables of the best Poisson model.

Predictive accuracy

Calculating the predictive accuracy of the best model of Poisson and the best model of the Negative Binomial.

Predictive accuracy of the Poisson model

Predicting with a 95% confidence interval

##           fit        lwr         upr
## 1    1.974648   1.332991    2.925179
## 2    4.834856   3.574338    6.539907
## 3    8.990605   6.638468   12.176149
## 4   11.180386   8.280195   15.096388
## 5   12.300336   8.491267   17.818102
## 6   19.821523  16.768624   23.430233
## 7   32.147925  23.952591   43.147276
## 8   40.554542  32.533719   50.552810
## 9   56.442432  43.971719   72.449934
## 10  79.443730  64.078012   98.494102
## 11 106.202123  89.950458  125.390034
## 12 144.907375 123.469566  170.067393
## 13 166.649056 143.604165  193.392076
## 14 196.542355 170.935542  225.985170
## 15 258.791545 229.143210  292.276013
## 16 358.155418 323.051083  397.074365
## 17 401.011234 363.835748  441.985183
## 18 417.425361 379.504684  459.135128
## 19 459.209857 419.162388  503.083528
## 20 516.458369 473.793775  562.964861
## 21 594.837208 549.008150  644.491897
## 22 675.890350 626.889835  728.720965
## 23 757.917923 705.838598  813.839850
## 24 899.457645 842.574303  960.181258
## 25 991.853197 932.047559 1055.496316

Calculating the proportion of times that the true value of y is in the 95% confidence interval of our model. (This is a kind of accuracy I suppose right?)

## [1] 0.92

We have a 92% of coverage!!! I think this is extremely good, even if it has been predicted in the training data. Because this doesn’t necessarily happens.



We have almost like a perfect model!!!

But now let’s perform a leave-one-out cross validation to see what result we get. (obs: I didn’t find the code to predict with a 95% interval, so I am just calculating the overall prediction error)

## [1] 130359.2 125147.4

I am not being able to understand this error number. Is it too high? I will research tomorrow. I know the one in the left is the standard error and the one in the right is the adjusted error.

I runned again a leane-one-out cross-validation with the other two Poisson models

## [1] 6993.699 6776.927
## [1] 4144.633 4099.331

So if you compare poisson4 with these other two models you can see that as you add more variables the test error decreases. So we should choose the model just with time as predictor (poisson1quasi)??? Even if the model with many predictors produces a better AIC and the variables are statistical significant? I don’t know what to do.

Predictive accuracy of the Negative Binomial model

Predicting with a 95% confidence interval

##           fit        lwr         upr
## 1    1.974639   1.332980    2.925175
## 2    4.834841   3.574319    6.539901
## 3    8.990565   6.638413   12.176140
## 4   11.180353   8.280153   15.096376
## 5   12.300362   8.491258   17.818195
## 6   19.821493  16.768580   23.430225
## 7   32.147934  23.952554   43.147368
## 8   40.554579  32.533683   50.552956
## 9   56.442454  43.971612   72.450166
## 10  79.443718  64.077780   98.494429
## 11 106.202242  89.950276  125.390569
## 12 144.907422 123.469042  170.068227
## 13 166.648983 143.603391  193.392951
## 14 196.542333 170.934568  225.986407
## 15 258.791526 229.141725  292.277863
## 16 358.155670 323.048905  397.077600
## 17 401.010827 363.832498  441.988235
## 18 417.425852 379.502115  459.139317
## 19 459.210217 419.159167  503.088182
## 20 516.458259 473.789373  562.969852
## 21 594.837596 549.003210  644.498538
## 22 675.889823 626.882851  728.727947
## 23 757.917904 705.830820  813.848777
## 24 899.457798 842.564370  960.192905
## 25 991.852774 932.035450 1055.509129

Calculating the proportion of times that the true value of y is in the 95% confidence interval of our model. (This is a kind of accuracy I suppose right?)

## [1] 0.92

We also obtain a 92% of coverage for the Negative Binomial model.



Also a very precise model.

But now let’s perform a leave-one-out cross validation to see what result we get. (obs: I didn’t find the code to predict with a 95% interval, so I am just calculating the overall prediction error)

## [1] 130356.7 125145.0

Approximately the same result as the Poisson model.

I runned again a leane-one-out cross-validation with the other two Negative Binomial models.

## [1] 119035.8 116020.8
## [1] 56611.84 56065.30

Here happens the same thing that happened in Poisson. If you compare nb5 with these other two models you can see that as you add more variables the test error decreases. So again we should choose the model just with time as predictor (nb1)??? Even if the model with many predictors produces a better AIC and the variables are statistical signficant? Again, I don’t know what to do.

Final conclusions

I thought that with the predictive accuracy we would be able to choose between the poisson and the negative binomial. So this didn’t happen. They perform equally, both when we predict with the 95% confidence interval in the training data and in the leave-one-out cross validation. And now there is the problem of the overfitting. So what do we do now?